Introduction



This is the final project for PSTAT 131, Intro to Statistical Machine Learning. I chose to use the Ames Housing Data Set from Kaggle, which is compiled by Dean De Cock for use in data science education. Since this project is supervised machine learning and the testing data on Kaggle doesn’t have the target variable Sale Price, I will use Kaggle’s training data as the entire data set and split it into training and testing data later on.

The goal is to predict the sale price of houses in Ames, Iowa using the 80 variables provided in the data set. It is a regression problem. Let’s begin!

Import Libraries and Load Dataset

Load in Data

Ames <- read.csv("/Users/meilinpan/Desktop/pstat131/FinalProject/HP/ames_housing.csv")
Ames <- clean_names(Ames)
set.seed(2000)

Initial Exploration

dim(Ames)
## [1] 1460   81

The Ames Housing data set contains 81 variables for 1,460 houses. Out of the 81 variables, sale_price is the response variable. This leaves us 80 predictor variables.

After taking a closer look at each variable, I decided to exclude the id variable since it is meaningless in the settings of this project.

Tidying Raw Data

suppressWarnings({
  library(visdat)
  vis_miss(Ames)
})

There are 19 variables with missing values.Three of them have over 90% values missing and all happen to be categorical:

1.alley is a categorical variable that describes the type of alley access to property. It has 93.77% values missing.

2.pool_qc is a categorical variable that describes the pool quality of houses. It has 99.52% values missing.

3.misc_feature, with 96.3% values missing, is a categorical variable that describes miscellaneous feature not covered in other categories.

From the data description on Kaggle, the ‘NA’ level in the above 3 variables simply means does not have. Since most of the houses in the data set don’t have allies, pools, or miscellaneous features, I will exclude these variables.

By the same logic, the misc_val and pool_area variable will be dismissed as well.

Ames <- subset(Ames, select = -c(id, alley, pool_qc, misc_feature, pool_area, misc_val))

Of the 74 variables, 34 are numerical and 40 are categorical. There are still quite a lot of variables and I need to further analyze them in order to select or modify them.


Explainatory Data Analysis


In this section, I will examine the relationships within each of my predictor variables as well as their relationship with the response variable - ‘sale_price’.

Sale Price


Since we are trying to predict the sale price, it would be helpful to look at its distribution.

Ames %>% 
  ggplot(aes(x = sale_price)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = min(Ames$sale_price),color='red',linetype='dashed')+
  scale_x_continuous(labels=dollar_format())+
  labs(title = "Distribution of Sale Price")+
  theme_bw()



From the histogram, the distribution of sale_price is right-skewed, and I will perform log transformation on it in the model preparation step. Additionally, the bars on the further right of the histogram suggest there exist a few houses that were sold at extremely high prices. They are likely to be the same houses identified in the previous step - the ones with pools.

Numerical Variables

Correlation Plot


Let’s visualize a correlation matrix of the numeric variables and also list out the correlation coefficients in descending order:

correlations <- Ames %>% 
  select(where(is.numeric)) %>%
  cor(use = "pairwise.complete.obs")%>% # handling missing data
  corrplot(type = "upper", diag = FALSE,method = "circle")

correlations <- Ames %>% 
  select(where(is.numeric)) %>%
  cor(use = "pairwise.complete.obs")
cor_with_sale_price <- correlations %>%
  as.data.frame() %>%
  rownames_to_column(var = "variable") %>%
  select(variable, sale_price) %>%
  arrange(desc(sale_price))
print(cor_with_sale_price)
##           variable  sale_price
## 1       sale_price  1.00000000
## 2     overall_qual  0.79098160
## 3      gr_liv_area  0.70862448
## 4      garage_cars  0.64040920
## 5      garage_area  0.62343144
## 6    total_bsmt_sf  0.61358055
## 7      x1st_flr_sf  0.60585218
## 8        full_bath  0.56066376
## 9  tot_rms_abv_grd  0.53372316
## 10      year_built  0.52289733
## 11  year_remod_add  0.50710097
## 12   garage_yr_blt  0.48636168
## 13    mas_vnr_area  0.47749305
## 14      fireplaces  0.46692884
## 15    bsmt_fin_sf1  0.38641981
## 16    lot_frontage  0.35179910
## 17    wood_deck_sf  0.32441344
## 18     x2nd_flr_sf  0.31933380
## 19   open_porch_sf  0.31585623
## 20       half_bath  0.28410768
## 21        lot_area  0.26384335
## 22  bsmt_full_bath  0.22712223
## 23     bsmt_unf_sf  0.21447911
## 24  bedroom_abv_gr  0.16821315
## 25    screen_porch  0.11144657
## 26         mo_sold  0.04643225
## 27     x3ssn_porch  0.04458367
## 28    bsmt_fin_sf2 -0.01137812
## 29  bsmt_half_bath -0.01684415
## 30 low_qual_fin_sf -0.02560613
## 31         yr_sold -0.02892259
## 32    overall_cond -0.07785589
## 33    ms_sub_class -0.08428414
## 34  enclosed_porch -0.12857796
## 35  kitchen_abv_gr -0.13590737



The are a few takeaways from the plot and table above:

  1. The highly positive correlation between sale_price and overall_qual (0.79) signifies a linear regression model that focuses on using overall_qual in the recipe might be very effective in predicting sale_price. Intuitively it also makes sense: people are willing to pay more for higher quality housing.

  2. Multicolinearity in a few variables exist. For the pairs with coefficients over 0.5, I will drop the one with lower correlation coefficient with sale_price. Between:

    (1) x1st_flr_sf and total_bsmt_sf, dropping x1st_flr_sf
    (2) tot_rms_abv_grd and gr_liv_area, dropping tot_rms_abv_grd
    (3) garage_yr_blt and year_built, dropping garage_yr_blt
    (4) garage_cars and garage_area, dropping garage_area
  1. Low Correlations to Sale Price Variables with poor correlation to the response variable: id ms_sub_class lot_frontage lot_area overall_cond bsmt_fin_sf1 bsmt_fin_sf2 bsmt_unf_sf x2nd_flr_sf low_qual_fin_sf bsmt_full_bath bsmt_half_bath half_bath bedroom_abv_gr kitchen_abv_gr wood_deck_sf open_porch_sf enclosed_porch x3ssn_porch screen_porch mo_sold yr_sold year_remod_add will be dropped.
Ames <- subset(Ames, select = -c(x1st_flr_sf, tot_rms_abv_grd, garage_yr_blt, garage_area, ms_sub_class, lot_frontage,lot_area,overall_cond ,bsmt_fin_sf1, bsmt_fin_sf2 ,bsmt_unf_sf ,x2nd_flr_sf, low_qual_fin_sf ,bsmt_full_bath, bsmt_half_bath, half_bath ,bedroom_abv_gr ,kitchen_abv_gr ,wood_deck_sf, open_porch_sf,enclosed_porch ,x3ssn_porch, screen_porch ,mo_sold ,yr_sold ,year_remod_add))



After dropping some irrelevant variables, the correlation plot looks much clearer.

correlations <- Ames %>% 
  select(where(is.numeric)) %>%
  cor(use = "pairwise.complete.obs")%>% # handling missing data
  corrplot(type = "upper", diag = FALSE,method = "circle", addCoef.col = 1)

Sale Price vs. Highly Correlated Variables


The following plots further examines some of the important features that are identified from the above plot.

p232<-Ames %>% 
  ggplot(aes(x=overall_qual, y=sale_price)) + 
  geom_jitter(width = 0.5, size = 0.5) +
  geom_smooth(method = "lm", se =F, col="red") +
  labs(title = "Sale Price vs. Overall Quality",y = "Sale Price", x= "Overall Quality")
p233<-Ames %>%
  ggplot(aes(x = gr_liv_area, y = sale_price)) +
  geom_point(aes(x = gr_liv_area, y = sale_price), size = 0.5) +
  geom_smooth(method = "lm", col = "blue") +
  theme_minimal() +
  labs(title = "Sale Price vs. Above Ground Living Area Square Feet",
       x = "Above Ground Living Area Square Feet",
       y = "Sale Price")
p234<-Ames %>%
  ggplot(aes(x = as.factor(garage_cars), y = sale_price)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Sale Price vs. Garage Cars", x = "Garage Size (Number of Cars)", y = "Sale Price") +
  theme_minimal()
p235<-Ames %>%
  ggplot(aes(x = year_built, y = sale_price)) +
  geom_point(alpha = 0.6, size = 0.5) +
  labs(title = "Sale Price vs. Year Built", x = "Year Built", y = "Sale Price") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 1))
grid.arrange(p232,p233,p234, p235, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'


  1. There is clearly a positive linear relationship between the overall quality of houses and their sale prices. Houses with higher overall qualities are likely to have higher prices.

  2. Larger above ground living areas also lead to higher sale prices, however, as the area gets larger, the sale price is more diverse. This might be due to larger houses have more variations from each other, from interior decoration to locations, etc.

  3. A interesting finding is that houses with garages that accommodates 3 cars have the highest sale prices out of all garage sizes. A possible explanation is that houses in expensive neighborhoods have limited space and 3-car-garage is the best the construction company can offer.

  4. In general, houses that are built more recently sold for higher prices with the exceptions of a few older houses with high prices. Also, the range of sale price differs in throughout the years. In general, the more recently built houses varies more in prices.

Lastly, a few outliers are spotted and will be handled later.

Categorical Variables

I will closely anaalyze the distribution and significance of the categorical variables there.

ANOVA



Since we have over 1,000 observations, I will assume the population is normally distributed. I will use ANOVA at 1% significance level on the categorical variables to test that across different levels of the categorical variables, in which ones is there difference in the means of sale_price.

categorical_variables <- names(Ames)[!sapply(Ames, is.numeric)]
p_values <- sapply(categorical_variables, function(var) {
  anova_result <- aov(as.formula(paste("sale_price ~", var)), data=Ames)
  p_value <- summary(anova_result)[[1]][["Pr(>F)"]][1]
  return(p_value)
})

sorted_vars <- names(p_values)[order(p_values)]
significant_vars <- sorted_vars[p_values[sorted_vars] < 0.01]
print(significant_vars)
##  [1] "neighborhood"   "exter_qual"     "kitchen_qual"   "bsmt_qual"     
##  [5] "garage_finish"  "foundation"     "heating_qc"     "garage_type"   
##  [9] "mas_vnr_type"   "bsmt_fin_type1" "sale_condition" "exterior1st"   
## [13] "exterior2nd"    "bsmt_exposure"  "sale_type"      "ms_zoning"     
## [17] "house_style"    "lot_shape"      "central_air"    "fireplace_qu"  
## [21] "electrical"     "paved_drive"    "roof_style"     "bldg_type"     
## [25] "bsmt_cond"      "land_contour"   "roof_matl"      "condition1"    
## [29] "garage_qual"    "garage_cond"    "exter_cond"     "lot_config"    
## [33] "functional"     "heating"        "fence"



We have 40 categorical variables aand 35 of them are significant. However, I still need to narrow it down. Some of them have multiple levels and if I apply One-Hot Encoding on all of them, there would be too many columns in the recipe.

Explore Individual Categorical variables



We will explore the top 4 categorical variables with the smallest p-values: neighborhood, exter_qual, bsmt_qual, and kitchen_qual.

visualize_cat_feature <- function(data, feature, color) {
  data[[feature]] <- with(data, reorder(data[[feature]], sale_price, median))
  
  ggplot(data, aes_string(x = feature, y = "sale_price")) +
    geom_boxplot(fill = color) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = paste('Sale Price vs.', feature),
         x = feature,
         y = 'Sale Price')
}

p241 <- visualize_cat_feature(Ames, "neighborhood", "green")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p242 <- visualize_cat_feature(Ames, "exter_qual", "blue")
p243 <- visualize_cat_feature(Ames, "bsmt_qual", "red")
p244 <- visualize_cat_feature(Ames, "kitchen_qual", "purple")

grid.arrange(p241, p242, p243, p244, ncol=2)


  1. Among different neighborhoods in Ames, the median house prices varies drastically. Nonetheless, I believe 25 different neighborhoods is too much and it may cause overfitting. Therefore, I need to recode the neighborhoods into regions based on the geography of Ames, Iowa.

  2. The quality of exterior material of different houses also affects the sale prices. When a house has excellent quality materials, its sale price is likely to be double the amount of other quality houses.

  3. bsmt_qual evaluates the height of the basement, and from our boxplot, houses with excellent basement quality (100+ inches in height) are the most expensive. The rest of the levels of basement heights have less affects on the sale price.

  4. Higher kitchen quality also results in higher prices.

At this point, I have gained a sense that this data set contains lots of measurements on multiple aspects of quality of houses. In the steps below, I will explore their distribution across levels to see if they are worth keeping or should be represented by the single variable overall_qual.

Feature Selection



Now we have a clearer understanding of some of the issues within the Ames Housing data set, we are going to modify the data set in this step to prepare it for our models.

Identifying and Removing Outliers



For numerical variables with high correlations to sale_price, I want to use the z-scores of each variable to detect and remove outliers. Any data that is 3 standard deviations away from the mean would be considered outliers. Especially for gr_liv_area and total_bsmt_sf, the outliers on the bottom right corners of the scatterplots are very noticeble.

#plots before removing outliers
plot_feature_vs_sale_price <- function(data, feature) {
  ggplot(data, aes_string(x = feature, y = "sale_price")) +
    geom_point(alpha = 0.5, size = 0.5) +
    theme_minimal() +
    labs(title = paste('Sale Price vs.', feature),
         x = feature,
         y = 'Sale Price')
}
p1 <- plot_feature_vs_sale_price(Ames, "overall_qual")
p2 <- plot_feature_vs_sale_price(Ames, "gr_liv_area")
p3 <- plot_feature_vs_sale_price(Ames, "garage_cars")
p4 <- plot_feature_vs_sale_price(Ames, "full_bath")
p5 <- plot_feature_vs_sale_price(Ames, "total_bsmt_sf")
p6 <- plot_feature_vs_sale_price(Ames, "year_built")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)

features <- c('overall_qual', 'gr_liv_area', 'garage_cars', 'total_bsmt_sf', 
               'full_bath', 'year_built')
z_scores <- scale(Ames[, features])
outlier_rows <- apply(z_scores, 1, function(row) any(abs(row) > 3))
Ames <- Ames[!outlier_rows, ]
p7 <- plot_feature_vs_sale_price(Ames, "overall_qual")
p8 <- plot_feature_vs_sale_price(Ames, "gr_liv_area")
p9 <- plot_feature_vs_sale_price(Ames, "garage_cars")
p10 <- plot_feature_vs_sale_price(Ames, "full_bath")
p11 <- plot_feature_vs_sale_price(Ames, "total_bsmt_sf")
p12 <- plot_feature_vs_sale_price(Ames, "year_built")
grid.arrange(p7, p8, p9, p10, p11, p12, ncol=3)


Now we can no longer see outliers in our scatterplots.

Categorical Features


We have 40 variables and lots of them have uneven distributions across different levels.

c1<-Ames %>% ggplot(aes(x = ms_zoning)) + geom_bar()
c2<-Ames %>% ggplot(aes(x = lot_shape)) + geom_bar()
c3<-Ames %>% ggplot(aes(x = lot_config)) + geom_bar()
c4<-Ames %>% ggplot(aes(x = house_style)) + geom_bar()
c5<-Ames %>% ggplot(aes(x = roof_style)) + geom_bar()
c6<-Ames %>% ggplot(aes(x = exterior1st)) + geom_bar()
c7<-Ames %>% ggplot(aes(x = exterior2nd)) + geom_bar()
c8<-Ames %>% ggplot(aes(x = exter_qual)) + geom_bar()
c9<-Ames %>% ggplot(aes(x = foundation)) + geom_bar()
c10<-Ames %>% ggplot(aes(x = bsmt_qual)) + geom_bar()
c11<-Ames %>% ggplot(aes(x = heating_qc)) + geom_bar()
c12<-Ames %>% ggplot(aes(x = kitchen_qual)) + geom_bar()
c13<-Ames %>% ggplot(aes(x = garage_type)) + geom_bar()
c14<-Ames %>% ggplot(aes(x = sale_condition)) + geom_bar()
grid.arrange(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14, nrow = 3)


I will be combining the observations in those levels into one level named “other”.

factor_cols <- sapply(Ames, is.factor)
Ames$ms_zoning <- Ames$ms_zoning %>% fct_lump_lowfreq()
Ames$lot_shape <- Ames$lot_shape %>% fct_lump_lowfreq()
Ames$lot_config <- Ames$lot_config %>% fct_lump_n(2)
Ames$house_style<- Ames$house_style %>% fct_lump_n(3)
Ames$roof_style <- Ames$roof_style %>% fct_lump_lowfreq()
Ames$exterior1st<- Ames$exterior1st %>% fct_lump_n(5)
Ames$exterior2nd<- Ames$exterior2nd %>% fct_lump_n(5)
Ames$exter_qual <- Ames$exter_qual %>% fct_lump_lowfreq()
Ames$foundation<- Ames$foundation %>%   fct_lump_n(2)
Ames$heating_qc<- Ames$heating_qc %>%   fct_lump_n(3)
Ames$kitchen_qual<- Ames$kitchen_qual%>%   fct_lump_n(2)
Ames$garage_type<- Ames$garage_type%>%   fct_lump_n(2)
Ames$sale_condition <- Ames$sale_condition %>%  fct_lump_lowfreq()
c11<-Ames %>% ggplot(aes(x = ms_zoning)) + geom_bar()
c22<-Ames %>% ggplot(aes(x = lot_shape)) + geom_bar()
c33<-Ames %>% ggplot(aes(x = lot_config)) + geom_bar()
c44<-Ames %>% ggplot(aes(x = house_style)) + geom_bar()
c55<-Ames %>% ggplot(aes(x = roof_style)) + geom_bar()
c66<-Ames %>% ggplot(aes(x = exterior1st)) + geom_bar()
c77<-Ames %>% ggplot(aes(x = exterior2nd)) + geom_bar()
c88<-Ames %>% ggplot(aes(x = exter_qual)) + geom_bar()
c99<-Ames %>% ggplot(aes(x = foundation)) + geom_bar()
c1010<-Ames %>% ggplot(aes(x = bsmt_qual)) + geom_bar()
c1111<-Ames %>% ggplot(aes(x = heating_qc)) + geom_bar()
c1212<-Ames %>% ggplot(aes(x = kitchen_qual)) + geom_bar()
c1313<-Ames %>% ggplot(aes(x = garage_type)) + geom_bar()
c1414<-Ames %>% ggplot(aes(x = sale_condition)) + geom_bar()
grid.arrange(c11,c22,c33,c44,c55,c66,c77,c88,c99,c1010,c1111,c1212,c1313,c1414, nrow = 3)



For categorical variables with multiple levels, we will be dropping variables with more than 85% observations in one level. Which include: street land_contour utilities land_slope condition1 condition2 bldg_type roof_matl exter_cond bsmt_cond bsmt_fin_type2 heating central_air electrical functional garage_qual garage_cond paved_drive sale_type

Ames <- subset(Ames, select = -c(street, land_contour, utilities, land_slope, condition1 ,condition2 ,bldg_type, roof_matl ,exter_cond ,bsmt_cond ,bsmt_fin_type2 ,heating, central_air ,electrical, functional ,garage_qual ,garage_cond, paved_drive, sale_type))

Imputing Missing Values

missing_values <- sapply(Ames, function(x) sum(is.na(x)))
missing_values[missing_values > 0]
##   mas_vnr_type   mas_vnr_area      bsmt_qual  bsmt_exposure bsmt_fin_type1 
##              8              8             36             37             36 
##   fireplace_qu    garage_type  garage_finish          fence 
##            683             76             76           1157


Missing values are categorized into the following:

1. According to Kaggle's data description, the "NA" level in some categorical variables means "does not have". Therefore, I will replace the "NA" with "None".
2. The numerical variables 'mas_vnr_area' with its mode.


#categorical variables
Ames<-Ames
cate_to_replace <- c( 'mas_vnr_type','bsmt_qual',  'bsmt_exposure', 'bsmt_fin_type1', 
                       'fireplace_qu', 'garage_type', 'garage_finish', 'fence')
Ames[cate_to_replace] <- lapply(Ames[cate_to_replace], function(col) {
  ifelse(is.na(col), 'None', col)
})

#numerical variables
mode_value <- as.numeric(names(sort(table(Ames$mas_vnr_area), decreasing=TRUE)[1]))
Ames$mas_vnr_area[is.na(Ames$mas_vnr_area)] <- mode_value
#categorical to factor
Ames[sapply(Ames, is.character)] <- lapply(Ames[sapply(Ames, is.character)], as.factor)
missing_values <- sapply(Ames, function(x) sum(is.na(x)))
missing_values[missing_values > 0]
## named integer(0)


All missing values are handled.

The ‘neighborhood’ Variable

library(forcats)
Ames %>% 
  ggplot(aes(y = forcats::fct_infreq(neighborhood))) +
  geom_bar(fill='#003399') +
  theme_base() +
  ylab("Neighborhood")



For the neighborhood variable, it has 25 levels and have a fair number of observations in each level, so we will recode it into regions according to the map of Ames, Iowa.

Ames Neighborhood
Ames Neighborhood
Ames <- Ames%>%
    mutate(region = forcats::fct_collapse(neighborhood,
                                        west = c("SawyerW", "Sawyer", "ClearCr", "Edwards",
                                                 "CollgCr","Crawfor", "SWISU", "Blueste",'Timber'),
                                        north = c("NridgHt", "NoRidge", "Gilbert", "NWAmes",
                                                      "Somerst", "Veenker", "Blmngtn", "BrkSide",
                                                    "NPkVill",'BrDale','StoneBr','NAmes','OldTown',
                                                  'IDOTRR'),
                                        southeast = c("Mitchel", "MeadowV"))) %>%
  select(-neighborhood)
Ames %>% 
  ggplot(aes(y = forcats::fct_infreq(region))) +
  geom_bar(fill='#003399') +
  theme_base() +
  ylab("Region")



Now that I recoded the neighborhood variable, I want to see if my previous take on 3-car-garage houses being the most expensive ones because of their location is reasonable.

Ames %>% 
  ggplot(aes(x = as.factor(region), y = sale_price)) +
  geom_boxplot(aes(fill = as.factor(garage_cars))) +
  theme_minimal() +
  labs(x = "Regions", y = "Sale Price", fill = "garage_cars", title = "Sale Price by Garage Capacity and Region") +
  theme(legend.position = "bottom")



In the boxplot, houses in the north region and have 3-car-garage are the most expensive ones out of all houses in Ames, Iowa. Across garage types, north region houses are generally as expensive if not more than the other region houses, and the maximum prices are also higher in the north region.

Ames %>% ggplot(aes(y = sale_price, fill = region,
                          x = factor(ms_zoning))) +
  stat_summary(fun = "mean", geom = "bar", 
               position = "dodge") + theme_bw()


MSZoning: Identifies the general zoning classification of the sale.

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density
   

The north region mostly consists of commercial and floating village residential zones. Which explains why this region has an overall higher sale prices than others. The southeast region mainly is low and medium density residential zone. Having a simple occupant type, this explains the small range of house price in the southeast region.

Model Preparations



After I prepared the data and gained understanding of the variables in the Ames Housing data set, I will set up the recipe and create folds for cross validation for my models.

Data Split



I use log transformation on the response variable, sale price, because from the histogram before, I noticed its skewness. Additionally, sale prices are mostly very large numbers and this will effect the evaluation of our model performance.



As for the splitting, I am chosing a 70/30 split because as I mentioned in the introduction, this data set was originally the training data set for the Kaggle Competition. Therefore, 30% of the 1,400 observations should be adequate for testing the models. Lastly, to ensure the even distribution of the response variable in the training and testing data, I stratified the split on sale price as well.

Ames <- Ames %>% mutate(sale_price = log10(sale_price))
hist(Ames$sale_price)

ames_split <- initial_split(Ames,prop=0.7,strata=sale_price)
ames_split
## <Training/Testing/Total>
## <1001/431/1432>
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

Recipe Creation



I will use a simple recipe on the linear regression model because the correlation coefficient to sale price of most numerical variables are very high. For the other models, I will only use one recipe for all models. The categorical data will be dummy-coded in the recipe creaation step. I will standardize all predictors as well by centering and normalizing them. This is because the numerical ones have different ranges, and I want to evaluate the model performances easily.

ames_recipe <- recipe(sale_price ~ ., data = ames_train) %>%
  #dummy coding nominal variables
  step_dummy(all_nominal_predictors())%>%
  #step_zv(all_predictors()) %>%
  #normalize
  step_center(all_predictors()) %>%
  step_scale(all_predictors())
#prep and bake
prep(ames_recipe) %>% 
  bake(new_data = ames_train) %>% 
  head() 
## # A tibble: 6 × 70
##   overall_qual year_built mas_vnr_area total_bsmt_sf gr_liv_area full_bath
##          <dbl>      <dbl>        <dbl>         <dbl>       <dbl>     <dbl>
## 1        0.704     -1.37        -0.590       -0.214        0.609     0.846
## 2       -0.781     -1.10        -0.590       -0.113       -0.878    -1.03 
## 3       -0.781     -0.210       -0.590        0.0143      -0.957    -1.03 
## 4       -1.52      -0.142       -0.590       -2.68        -0.411     0.846
## 5       -1.52      -1.50        -0.590       -1.33        -2.07     -1.03 
## 6       -1.52      -1.74        -0.590       -0.998       -0.366    -1.03 
## # ℹ 64 more variables: fireplaces <dbl>, garage_cars <dbl>, sale_price <dbl>,
## #   ms_zoning_Other <dbl>, lot_shape_Other <dbl>, lot_config_Inside <dbl>,
## #   lot_config_Other <dbl>, house_style_X1Story <dbl>,
## #   house_style_X2Story <dbl>, house_style_Other <dbl>, roof_style_Other <dbl>,
## #   exterior1st_MetalSd <dbl>, exterior1st_Plywood <dbl>,
## #   exterior1st_VinylSd <dbl>, exterior1st_Wd.Sdng <dbl>,
## #   exterior1st_Other <dbl>, exterior2nd_MetalSd <dbl>, …
linear_recipe <- recipe(sale_price ~ overall_qual + year_built +
                        total_bsmt_sf + gr_liv_area + house_style, data = ames_train) %>% 
  step_dummy(house_style) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())
prep(linear_recipe) %>% 
  bake(new_data = ames_train) %>% 
  head()
## # A tibble: 6 × 8
##   overall_qual year_built total_bsmt_sf gr_liv_area sale_price
##          <dbl>      <dbl>         <dbl>       <dbl>      <dbl>
## 1        0.704     -1.37        -0.214        0.609       5.11
## 2       -0.781     -1.10        -0.113       -0.878       5.07
## 3       -0.781     -0.210        0.0143      -0.957       5.11
## 4       -1.52      -0.142       -2.68        -0.411       4.95
## 5       -1.52      -1.50        -1.33        -2.07        4.84
## 6       -1.52      -1.74        -0.998       -0.366       4.60
## # ℹ 3 more variables: house_style_X1Story <dbl>, house_style_X2Story <dbl>,
## #   house_style_Other <dbl>

K-Fold Cross Validation


We will stratify on the outcome sale_price and use 10 folds to perform cross validation.

ames_folds <- vfold_cv(ames_train, v=10, strata = sale_price)

Model Building



We will use 5 models in total: Linear Regression, K-Nearest Neighbor, Elastic Net Regression, Random Forest, and Gradient-Boosted Trees. I decided not to use elastic net regression instead of polynomial regression model for its ability to handle multicolinearity within my data set. In addition, it can be inferred from the previous visuallizations that the predictor-response relationship is likely to be linear. Polynomial regression is not a good way to capture linear trends and therefore should be excluded.

For their evaluation, I will use the RMSE (Root Mean Squared Error). Since the RMSE measures the distance from the prediction to the actual values, the lower the RMSE, the better the model performed.

  1. Setting up models
#Linear Regression
lm_mod <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

#K Nearest Neighbors
knn_mod <- nearest_neighbor(neighbors = tune())%>%
  set_mode("regression")%>%
  set_engine("kknn")
#Elastic Net
en_reg_spec <- linear_reg(penalty = tune(),mixture = tune()) %>%
  set_mode("regression")%>%
  set_engine("glmnet")
#Random Forest
rf_reg_spec <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")
#Gradient_Boosted Trees
bt_reg_spec <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("regression")
  1. Setting up workflows
#Linear Regression
lm_workflow <- workflow()%>%
  add_model(lm_mod)%>%
  add_recipe(linear_recipe)
#K Nearest Neighbors
knn_workflow <- workflow()%>%
  add_model(knn_mod)%>%
  add_recipe(ames_recipe)
#Elastic Net
en_workflow <- workflow()%>%
  add_model(en_reg_spec)%>%
  add_recipe(ames_recipe)
#Random Forest
rf_workflow <- workflow()%>%
  add_model(rf_reg_spec)%>%
  add_recipe(ames_recipe)
#Gradient_Boosted Trees
bt_workflow <- workflow()%>%
  add_model(bt_reg_spec)%>%
  add_recipe(ames_recipe)
  1. Creating tuning grids
#K Nearest Neighbors
knn_grid <- grid_regular(neighbors(range=c(1,20)),levels=10)
#Elastic Net
en_grid <- grid_regular(penalty(range = c(0,0.1),trans = identity_trans()),mixture(range = c(0, 1)),levels = 10)
#Random Forest
rf_grid <- grid_regular(mtry(range = c(2, 8)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 4)
#GBT
bt_grid <- grid_regular(mtry(range = c(2, 8)), 
                        trees(range = c(100, 400)),
                        learn_rate(range = c(0.01, 0.1)),
                        levels = 5)


Elaborate on Parameters:

K Nearest Neighbors: I chose 1 through 10 neighbors for the model to learn because smaller number of neighbors can better capture the patterns within the data but it will also be more sensitive to outliers, noise, etc. Large number of neighbors is the other way around.

Elastic Net: it combines the regularization of both lasso and Ridge. For the penalty, I picked the range of 0 to 0.1 because larger values mean more regularization. Regularization can help prevent overfitting by penalizing large coefficients, but too much can oversimplify the model. Then for the mixture, I chose 0 to 1 so that I can compare every combination between Ridge and Lasso.

Random Forest:A range of 2 to 8 might make sense because this allows the model to test various subsets of predictors at each split, which might help in finding an optimal subset.For trees, 200 to 600 is a common and standard approach to ensure robust learning but not to the point where adding more trees doesn’t improve the model. The minimum number of data points in a node that are required for the node to be split further (min_n) is 10 to 20. A smaller value would allow deeper trees, potentially capturing more intricate patterns in the data, but at the risk of overfitting. The range 10 to 20 provides a balanced approach. Level of 4 just means there are 4^3 combinations of parameters for the model.

Gradient-Boosted Trees: most range of parameters are similar to the random forest model, by choosing 2 to 8 for mtry, I am testing various subsets of predictors to find an optimal subset for each individual decision tree in the boosting process. The default learn rate is 0.01 to 0.1.

4.Tuning models and saving the results

# K Nearest Neighbors
knn_tune <- tune_grid(
    knn_workflow,
    resamples = ames_folds,
    grid = knn_grid
)
save(knn_tune, file = "knn_tune.rda")
# Elastic Net
en_tune <- tune_grid(
  en_workflow,
  resamples = ames_folds,
  grid = en_grid,
  control = control_grid(verbose = TRUE)
)
save(en_tune, file = "en_tune.rda")
# Random Forest
rf_tune_res <- tune_grid(
  rf_workflow,
  resamples = ames_folds,
  grid = rf_grid,
  control = control_grid(verbose = TRUE)
)
save(rf_tune_res, file = "rf_tune_res.rda")
#GBT
bt_tune_res <- tune_grid(
  bt_workflow,
  resamples = ames_folds,
  grid = bt_grid,
  control = control_grid(verbose = TRUE)
)
save(bt_tune_res, file = "bt_tune_res.rda")

Model Results

load("knn_tune.rda")
load("en_tune.rda")
load("rf_tune_res.rda")
load("bt_tune_res.rda")

# Linear Regression
lm_res <- lm_workflow %>% fit_resamples(ames_folds)
lm_rmse <- collect_metrics(lm_res) %>% slice(1)

# KNN
knn_rmse <- collect_metrics(knn_tune) %>% arrange(mean) %>% slice(1)

# Elastic Net
en_rmse <- collect_metrics(en_tune) %>% arrange(mean) %>% slice(1)

# Random Forest
rf_rmse <- collect_metrics(rf_tune_res) %>% arrange(mean) %>% slice(1)

# Boosted Trees
bt_rmse <- collect_metrics(bt_tune_res) %>% arrange(mean) %>% slice(1)

# Creating a tibble of all the models and their RMSE
final_compare_tibble <- tibble(
  Model = c("Linear Regression", "K Nearest Neighbors", "Elastic Net", "Random Forest", "Boosted Trees"),
  RMSE = c(lm_rmse$mean, knn_rmse$mean, en_rmse$mean, rf_rmse$mean, bt_rmse$mean)
)

# Arranging by lowest RMSE
final_compare_tibble <- final_compare_tibble %>% 
  arrange(RMSE)

final_compare_tibble
## # A tibble: 5 × 2
##   Model                 RMSE
##   <chr>                <dbl>
## 1 Elastic Net         0.0618
## 2 Random Forest       0.0652
## 3 Linear Regression   0.0707
## 4 K Nearest Neighbors 0.0810
## 5 Boosted Trees       0.0890


The Elastic Net model has the lowest RMSE of 0.061485. Let’s also visualize the RMSEs side by side.

# Creating a data frame of the model RMSE's
all_models <- data.frame(
  Model = c("Linear Regression", "K Nearest Neighbors", "Elastic Net", "Random Forest", "Boosted Trees"),
  RMSE = c(lm_rmse$mean, knn_rmse$mean, en_rmse$mean, rf_rmse$mean, bt_rmse$mean)
)

# Plotting the RMSE values using a bar chart
ggplot(all_models, aes(x=Model, y=RMSE)) +
  geom_bar(stat = "identity", aes(fill = Model)) +
  scale_fill_manual(values = c("lightblue", "lightcoral", "lightsalmon", "lightgreen", "lightpink")) +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Comparing RMSE by Model")

Elastic Net Plot

autoplot(en_tune,metric = 'rmse')
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis



For the elastic net model, the scale of the y-axis for both metrics is relatively small. This indicates that the resulting performance doesn’t really vary drastically across any of the models we’ve fit. The amount of regularization, on the x-axis, is the penalty hyperparameter, which covers the range of values we specified (0 to 0.1), and the values of mixture are represented by the different-colored lines. Smaller penalty and mixture produce better RMSE across the 10 folds. As the penalty increases, the model does worse because the coefficients of the predictors are being reduced to very small values (due to the penalty), which makes it much more difficult for the model to predict.

Overall, the models with zero percentage of lasso penalty, or the ridge regression models, do better, as indicated by the red line being consistently higher (or lower) than the others. This implies that it yields better performance to avoid reducing predictors all the way down to zero, as can happen in the case of lasso regression.

Gradient Boosted Trees Autoplot

autoplot(bt_tune_res,metric = 'rmse')



For the boosted tree model, across 5 levels, as the number of randomly selected predictors increases, the RMSE of the model tends to decrease. The learning rate and the model RMSE have an inverse relationship. The larger the learning rate, the worse the model learned. The number of trees does not make significant impact on the model performance. Overall, from the graph, between 4-6 randomly selected predictors work the bset especially when learning rate is low. A high learning rate causes the model to learn faster, but it also trains the model less and makes it less generalized.

Random Forest Autoplot

autoplot(rf_tune_res,metric = 'rmse')



For the random forest model, the higher number of randomly selected predictors, the lower the RMSE. However, the number of trees and the minimal node size makes little impact on the RMSE. For the wage data, as we increase mtry, model performance seems to improve – RMSE tends to decrease. The number of trees, doesn’t make much of a difference overall, as those lines are virtually on top of each other. A minimal node size, or min_n, of 10 seems to produce slightly better results than a minimum size of 20.

Since the elastic net regression model performed the best out of the 5 models, let’s pull up the entry of parameters of the best elastic net regression model.

en_tune %>% 
  collect_metrics() %>% 
  arrange(mean) %>% 
  slice(1)
## # A tibble: 1 × 8
##   penalty mixture .metric .estimator   mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                 
## 1  0.0111   0.111 rmse    standard   0.0618    10 0.00180 Preprocessor1_Model012



Low penalty and mixture suggest the reduction in regularization, indicating my predictors are not as collinear as I assumed. Also, the low strength on regualarization might be due to the outlier removal in ealier step.

Analyze Best Model Result


Elastic Net is our best model, and we will be fitting it to the training and testing data.

# Fitting to the training data
best_en_train <- select_best(en_tune, metric = 'rmse')
en_final_workflow_train <- finalize_workflow(en_workflow, best_en_train)
en_final_fit_train <- fit(en_final_workflow_train, data = ames_train)
# Creating the predicted vs. actual value tibble
ames_tibble <- predict(en_final_fit_train, new_data = ames_test %>% select(-sale_price))
ames_tibble <- bind_cols(ames_tibble, ames_test %>% select(sale_price))

ames_tibble %>% 
  ggplot(aes(x = 10^.pred, y = 10^sale_price)) +
  geom_point(alpha = 0.4) +
  geom_abline(lty = 2) +
  theme_grey() +
  coord_obs_pred() +
  labs(title = "Predicted Values vs. Actual Values")



Most points are on the line y=x, suggesting the elastic net regression model did well in predicting the majority of slae prices in the testing data. The few outliers on the top right corner show that the model underestimates sale prices of more expensive houses. My explaination for this is that there weren’t enough data on the high-end houses to train the models on.

# make predictions for the testing data and take a look at its testing RMSE:

best_en_reg <- select_best(en_tune,metric = 'rmse')
final_en_model <- finalize_workflow(en_workflow, best_en_reg)
final_en_model <- fit(final_en_model, ames_train)
final_en_model_test <- augment(final_en_model, ames_test)
rmse(final_en_model_test, truth = sale_price, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0590



A VIP plot further examines the important predictors.

en_final_fit_train %>% 
  extract_fit_engine() %>% 
  vip(aesthetics = list(fill = "red3"))



This plot tells us that the two most useful predictors of sale price in this elastic net model are gr_liv_area and overall_qual.

Conclusion

After modeling and analysis of the Ames Housing data set, the Elastic Net model emerged as the most effective in predicting house prices. This was anticipated since Elastic Net combines the strengths of both Ridge and Lasso regression, allowing it to handle multicollinearity efficiently while also performing feature selection. Despite its strengths, the model’s predictions, as indicated by the RMSE, suggest there’s room for improvement.

On the other end of the spectrum, the K-Nearest Neighbors (KNN) model performed the least effectively. The challenge with KNN, especially in a dataset like ours with numerous predictors, is the curse of dimensionality. As dimensions increase, data points tend to drift apart, making it harder for KNN to find close neighbors, thus affecting prediction accuracy.

As someone who has always been interested in the real estate market, this experience provided me a way to practice my machine learing techniques on prepared data. In the future, I want to go deeper into the data and possibly explore other variables not present in the current dataset that might impact house prices. For instance, understanding more about the neighborhood’s amenities, such as proximity to schools, parks, or public transport, could provide further insights. Another intriguing aspect to consider might be the historical significance of some houses or whether they’re located in a heritage precinct, as these factors can significantly influence price.

Nonetheless, a lingering thought could be around the decision to transform certain variables or the choice of polynomial terms for some predictors. Experimenting with different transformations or adding interactions could potentially enhance the model’s performance.

If I were to further this research, I would also be keen to compare the Ames housing dataset with housing data from other regions or cities. It would be captivating to discern if house pricing trends and influential factors remain consistent across different geographical locales.

All in all, this journey into predicting house prices with the Ames data set has been both challenging and enlightening. It’s been a rigorous exercise in refining machine learning skills and diving deep into data analysis. In particular, I became aware of the importance of avoiding train-test leakage and I see areas of improvement of my skills through this project. While the Elastic Net model may not be flawless, it’s rewarding to have developed a model that provides a reasonable understanding of the factors influencing model performances.